Introduction

This workshop is designed to provide beginners with foundational understanding of R programming language. Through a combination of theoretical explanations, hands-on coding exercises, and practical applications, participants will learn how to leverage R for data visualization of cancer biology datasets.

The workshop will cover essential programming concepts and gradually introduce more advanced topics, with a focus on using the ggplot2 package suite for data visualization. The aim of this workshop is to analyse data and create informative plots.

Learning Objectives

Participants will gain the following skills:

  • Proficiency in using R and RStudio for data analysis.
  • Basic R programming skills.
  • Reading datasets using readr package.
  • Creating various types of plots using ggplot2 package.

Prerequisites

Before starting this course you will need to ensure that your computer is set up with the required software. If you have any difficulty installing any of this software then please contact one of the trainers for help.

Installing R and RStudio

R and RStudio are separate downloads and installations.

R is the underlying statistical computing environment. The base R system and a very large collection of packages that give you access to a huge range of statistical and analytical functionality are available from CRAN, the Comprehensive R Archive Network.

However, using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive.

Local Installation

You need to install R before you install RStudio.

Windows
  • If you already have R and RStudio installed:

    • Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
    • To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.
  • If you don’t have R and RStudio installed:

    • Download R from the CRAN website.
    • Run the .exe file that was just downloaded
    • Go to the RStudio download page
    • Under Installers select RStudio x.yy.zzz - Windows 10/8/7 where x, y, and z represent version numbers)
    • Double click the file to install it
    • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
macOS
  • If you already have R and RStudio installed:

    • Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
    • To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.
  • If you don’t have R and RStudio installed:

    • Download R from the CRAN website.
    • Select the .pkg file for the latest R version
    • Double click on the downloaded file to install R
    • It is also a good idea to install XQuartz (needed by some packages)
    • Go to the RStudio download page
    • Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
    • Double click the file to install RStudio
    • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
Linux
  • Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 4.3.2.
  • Go to the RStudio download page
  • Under Installers select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Installing R Packages

On this course we will be making use of a brilliant collection of packages designed for data science called the tidyverse that make it much easier and more fun to work with your data. After installing R and RStudio, follow the instructions below to install the tidyverse package suite.

  • After starting RStudio, at the console type: install.packages("tidyverse") (look for the ‘Console’ tab and type at the > prompt)
  • You can also do this by going to Tools -> Install Packages and typing the names of the packages separated by a comma.

Data

The Metabric study characterized the genomic mutations and gene expression profiles for 2509 primary breast tumours. In addition to the gene expression data generated using microarrays, genome-wide copy number profiles were obtained using SNP microarrays. Targeted sequencing was performed for 2509 primary breast tumours, along with 548 matched normals, using a panel of 173 of the most frequently mutated breast cancer genes as part of the Metabric study.

Refrences:

Both the clinical data and the gene expression values were downloaded from cBioPortal.

We excluded observations for patient tumor samples lacking expression data, resulting in a data set with fewer rows.

R!

R is a powerful programming language and open-source software widely used for statistical computing and data analysis. This programming language is developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R has gained popularity among statisticians, data scientists, researchers, and analysts for its flexibility, extensibility, and robust statistical capabilities.

Why learn R?

Here are several compelling reasons to consider learning R:

  • Statistical Analysis
  • Data Visualization
  • Open Source
  • Community Support
  • Extensibility
  • Integration with Other Languages
  • Data Science and Machine Learning
  • Widely Used in Academia and Industry
  • Continuous Development

Getting Started with R

To begin working with R, users typically install an Integrated Development Environment (IDE) such as RStudio, which provides a user-friendly interface for coding, debugging, and visualizing results. R scripts are written in the R language and can be executed interactively or saved for later use.

A look around RStudio

Open RStudio. You will see four windows (aka panes). Each window has a different function. The screenshot below shows an analogy linking the different RStudio windows to cooking.

Console Pane

On the left-hand side, you’ll find the console. This is where you can input commands (code that R can interpret), and the responses to your commands, known as output, are displayed here. While the console is handy for experimenting with code, it doesn’t save any of your entered commands. Therefore, relying exclusively on the console is not recommended.

History Pane

The history pane (located in the top right window) maintains a record of the commands that you have executed in the R console during your current R session. This includes both correct and incorrect commands.

You can navigate through your command history using the up and down arrow keys in the console. This allows you to quickly recall and re-run previous commands without retyping them.

Environment Pane

The environment pane (located in the top right window) provides an overview of the objects (variables, data frames, etc.) that currently exist in your R session. It displays the names, types, dimensions, and some content of these objects. This allows you to monitor the state of your workspace in real-time.

Plotting Pane

The plotting pane (located in the bottom right window) is where graphical output, such as plots and charts, is displayed when you create visualizations in R. The Plotting pane often includes tools for zooming, panning, and exporting plots, providing additional functionality for exploring and customizing your visualizations.

Help Pane

The help pane (located in the bottom right window) is a valuable resource for accessing documentation and information about R functions, packages, and commands. When you type a function or command in the console and press the F1 key (Mac: fn + F1) the Help pane displays relevant documentation. Additionally, you can type a keyword in the text box at the top right corner of the Help Pane.

Files Pane

The files pane provides a file browser and file management interface within RStudio. It allows you to navigate through your project directories, view files, and manage your file system.

Packages Pane

This pane provides a user-friendly interface for managing R packages. It lists installed packages and allows you to load, unload, update, and install packages.

Viewer Pane

It is used to display dynamic content generated by R, such as HTML, Shiny applications, or interactive visualizations.

Working directory

Opening an RStudio session launches it from a specific location. This is the working directory. R looks in the working directory by default to read in data and save files. You can find out what the working directory is by using the command getwd(). This shows you the path to your working directory in the console. In Mac this is in the format /path/to/working/directory and in Windows C:\path\to\working\directory. It is often useful to have your data and R scripts in the same directory and set this as your working directory. We will do this now.

Make a folder for this course somewhere on your computer that you will be able to easily find. Name the folder for example, Intro_R_course. Then, to set this folder as your working directory:

In RStudio click on the Files tab and then click on the three dots, as shown below.

In the window that appears, find the folder you created (e.g. Intro_R_course), click on it, then click Open. The files tab will now show the contents of your new folder. Click on More → Set As Working Directory, as shown below.

Note: You can use an RStudio project as described here to automatically keep track of and set the working directory.

Quarto Document

In RStudio, the Script pane (located at the top left window) serves as a dedicated space for writing, editing, and executing Quarto documents. It is where you compose and organize your R code, making it an essential area for creating reproducible and well-documented analyses.

RStudio provides syntax highlighting in the Script pane, making it easier to identify different components of your code. You can execute individual lines or selections of code from the Script pane. This helps in testing and debugging code without running the entire document.

Open a New Quarto Document

Navigate to File → New File → Quarto Document.

Add a title (e.g. IntroR), your name as Author and save this document as ‘IntroR-doc.qmd’ in your current working directory (e.g. IntroR). A new pane will emerge in the top-left corner.

Download the Quarto document from the following this link. Click the download raw file button at the right side. Open it using the RStudio. We will use it as a notebook during this workshop.

Comments

In R, any text following the hash symbol # is termed a comment. R disregards this text, considering it non-executable. Comments serve the purpose of documenting your code, aiding your future understanding of specific lines, and highlighting the intentions or challenges encountered.

RStudio makes it easy to comment or uncomment a paragraph: Select the lines you want to comment (to comment a set of lines) or placing the cursor at any location of a line (to comment a single line), press at the same time on your keyboard + Shift + C (mac) or Ctrl + Shift + C (Windows/Linux).

Extensive use of comments is encouraged throughout this course.

# This is a comment. Ignored by R. But useful for me!

Executing Commands

Executing commands or running code is the process of submitting a command to your computer, which does some computation and returns an answer. In RStudio, there are several ways to execute commands:

  • Select the line(s) of code using the mouse, and then click Run at the top right corner of the R text file.
  • Select Run Lines from the Code menu.
  • Click anywhere on the line of code and click Run.
  • Select the line(s) you want to run. Press + Return (Mac) or Ctrl + Enter (Windows/Linux) to run the selected code.

We suggest the third option, which is fastest. This link provides a list of useful RStudio keyboard shortcuts that can be beneficial when coding and navigating the RStudio IDE.

When you type in, and then run the commands shown in the grey boxes below, you should see the result in the Console pane at bottom left.

Simple Maths in R

We can use R as a calculator to do simple maths.

3 + 5
[1] 8

More complex calculator functions are built in to R, which is the reason it is popular among mathematicians and statisticians.

Task 1: What is the output of this expression: 6050 * 72 + 124 / 25000 * 0.001 - 433576

Getting Help

In R, the ? operator is used for accessing help documentation for a specific function or topic. When you type ? followed by the name of a function, you get detailed information about that function. For example try:

?mean
View Output
<!DOCTYPE html> R: Arithmetic Mean
mean R Documentation

Arithmetic Mean

Description

Generic function for the (trimmed) arithmetic mean.

Usage

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)

Arguments

x

An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.

trim

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

na.rm

a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.

further arguments passed to or from other methods.

Value

If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one. If x is not logical (coerced to numeric), numeric (including integer) or complex, NA_real_ is returned, with a warning.

If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trim observations deleted from each end before the mean is computed.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

weighted.mean, mean.POSIXct, colMeans for row and column means.

Examples

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

The above command displays the help documentation for the mean function, providing information about its usage, arguments, and examples.

Tip

Tab completion A very useful feature is Tab completion. You can start typing and use Tab to autocomplete code, for example, a function name.

R Packages

Many developers have built 1000s of functions and shared them with the R user community to help make everyone’s work easier and more efficient. These functions (short programs) are generally packaged up together in (wait for it) Packages. For example, the tidyverse package is a compilation of many different functions, all of which help with data transformation and visualization. Packages also contain data, which is often included to assist new users with learning the available functions.

Installing Packages

Packages are hosted on repositories, with CRAN (Comprehensive R Archive Network) being the primary repository. To install packages from CRAN, you use the install.packages() function. For example:

install.packages("tidyverse")

This will spit out a lot of text into the console as the package is being installed. Once complete you should have a message:

The downloaded binary packages are in... followed by a long directory name.

To remove an installed package:

remove.packages("tidyverse")

Loading Packages

After installation, you need to load a package into your R session using the library() function. For example:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This makes the functions and datasets from the ‘tidyverse’ package available for use in your current session.

Tip

You only need to install a package once. Once installed, you don’t need to reinstall it in subsequent sessions. However, you do need to load the package at the beginning of each R session using the library() function before you can utilize its functions and features. This ensures that the package is actively available for use in your current session.

Task 2: Install and load the two packeages: readr and ggplot2.

Package Documentation

Each package comes with documentation that explains how to use its functions. You can access this information using the help() function or by using ? before the function name:

help(tidyverse)
View Output
<!DOCTYPE html> R: tidyverse: Easily Install and Load the ‘Tidyverse’
tidyverse-package R Documentation

tidyverse: Easily Install and Load the ‘Tidyverse’

Description

logo

The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.

Author(s)

Maintainer: Hadley Wickham hadley@rstudio.com

Other contributors:

  • RStudio [copyright holder, funder]

See Also

Useful links:

Task 3: Display the documentation of readr package.

Visualizing Data

ggplot2 package simplifies the creation of plots using data frames. This package offers a streamlined interface for defining variables to plot, configuring their display, and adjusting visual attributes. Consequently, adapting to changes in the data or transitioning between plot types requires only minimal modifications. This feature facilitates the creation of high-quality plots suitable for publication with minimal manual adjustments.

Reading the Data

In this section, you’ll learn the basics of reading data files into R using the readr package. We will use the read_csv() function from readr package to import a dataset. CSV short for Comma Separated Values, is a text format commonly used to store tabular data. Conventionally the first line contains column headings.

The first argument of the read_csv() function takes the path to the file (or a web link). The following code will download the metabric dataset.

library(readr)
metabric <- read_csv("https://zenodo.org/record/6450144/files/metabric_clinical_and_expression_data.csv")

Exploring the Data

In the previous section we imported a dataset, into a dataframe named metabric. This section demonstrates different ways to view this dataset.

When the name of the object (data frame) is typed, the first few lines along with some information, such as the number of rows are displayed:

metabric
Output
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage Neoplasm_histologic_grade Lymph_nodes_examined_positive Lymph_node_status Cancer_type ER_status PR_status HER2_status HER2_status_measured_by_SNP6 PAM50 3-gene_classifier Nottingham_prognostic_index Cellularity Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
MB-0000 1 75.65 140.50000 LIVING Living NO YES 22 2 3 10 3 Breast Invasive Ductal Carcinoma Positive Negative Negative NEUTRAL claudin-low ER-/HER2- 6.044 NA 4ER+ NA 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728
MB-0002 1 43.19 84.63333 LIVING Living NO YES 10 1 3 0 1 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- High Prolif 4.020 High 4ER+ 2 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570
MB-0005 1 48.87 163.70000 DECEASED Died of Disease YES NO 15 2 2 1 2 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB NA 4.030 High 3 2 10.041281 9.725825 7.376123 6.404516 6.751566 9.289758 11.698169 10.306115
MB-0006 1 47.68 164.93333 LIVING Living YES YES 25 2 2 3 2 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumB NA 4.050 Moderate 9 1 10.404685 10.334979 6.815637 6.869241 7.219187 8.667723 11.863379 10.472181
MB-0008 1 76.97 41.36667 DECEASED Died of Disease YES YES 40 2 3 8 3 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumB ER+/HER2- High Prolif 6.080 High 9 2 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961
MB-0010 1 78.77 7.80000 DECEASED Died of Disease NO YES 31 4 3 0 1 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB ER+/HER2- High Prolif 4.062 Moderate 7 4 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164
MB-0014 1 56.45 164.33333 LIVING Living YES YES 10 2 2 1 2 Breast Invasive Ductal Carcinoma Positive Positive Negative LOSS LumB NA 4.020 Moderate 3 4 10.793832 9.276507 7.720952 5.992706 7.481835 8.365527 11.482627 10.755199
MB-0022 1 89.08 99.53333 DECEASED Died of Other Causes NO YES 29 2 2 1 2 Breast Mixed Ductal and Lobular Carcinoma Positive Negative Negative NEUTRAL claudin-low NA 4.058 Moderate 3 1 10.440667 8.613192 5.592522 6.165420 7.593330 7.872962 10.679403 9.945023
MB-0028 1 86.41 36.56667 DECEASED Died of Other Causes NO YES 16 2 3 1 2 Breast Invasive Ductal Carcinoma Positive Negative Negative GAIN LumB ER+/HER2- High Prolif 5.032 Moderate 9 4 12.521038 10.678266 5.325554 6.220372 6.250678 10.260059 12.148375 10.936002
MB-0035 1 84.22 36.26667 DECEASED Died of Disease NO NO 28 2 2 0 1 Breast Invasive Lobular Carcinoma Positive Negative Negative LOSS Her2 ER+/HER2- High Prolif 3.056 High 3 5 7.536847 11.514514 5.587666 6.411477 5.988243 10.212611 12.804542 13.474571

The dim() function prints the dimensions (rows x columns) of the data frame:

dim(metabric)
Output
[1] 1904   32

This information is available at the environment pane in the top right panel as the number of observations (rows) and variables (columns).

The nrow() function prints the number of rows while ncol() prints the number of columns:

nrow(metabric)
ncol(metabric)
Output
[1] 1904
[1] 32

The View() function gives a spreadsheet-like view of the data frame:

View(metabric)

By clicking the object on the environment tab also gives a spreadsheet-like view of the object:

The head() function prints the top 6 rows of a data frame:

head(metabric)
Output
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage Neoplasm_histologic_grade Lymph_nodes_examined_positive Lymph_node_status Cancer_type ER_status PR_status HER2_status HER2_status_measured_by_SNP6 PAM50 3-gene_classifier Nottingham_prognostic_index Cellularity Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
MB-0000 1 75.65 140.50000 LIVING Living NO YES 22 2 3 10 3 Breast Invasive Ductal Carcinoma Positive Negative Negative NEUTRAL claudin-low ER-/HER2- 6.044 NA 4ER+ NA 8.929817 9.333972 5.680501 6.338739 5.704157 6.932146 7.953794 9.729728
MB-0002 1 43.19 84.63333 LIVING Living NO YES 10 1 3 0 1 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- High Prolif 4.020 High 4ER+ 2 10.047059 9.729606 7.505424 6.192507 5.757727 11.251197 11.843989 12.536570
MB-0005 1 48.87 163.70000 DECEASED Died of Disease YES NO 15 2 2 1 2 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB NA 4.030 High 3 2 10.041281 9.725825 7.376123 6.404516 6.751566 9.289758 11.698169 10.306115
MB-0006 1 47.68 164.93333 LIVING Living YES YES 25 2 2 3 2 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumB NA 4.050 Moderate 9 1 10.404685 10.334979 6.815637 6.869241 7.219187 8.667723 11.863379 10.472181
MB-0008 1 76.97 41.36667 DECEASED Died of Disease YES YES 40 2 3 8 3 Breast Mixed Ductal and Lobular Carcinoma Positive Positive Negative NEUTRAL LumB ER+/HER2- High Prolif 6.080 High 9 2 11.276581 9.956267 7.331223 6.337951 5.817818 9.719781 11.625006 12.161961
MB-0010 1 78.77 7.80000 DECEASED Died of Disease NO YES 31 4 3 0 1 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB ER+/HER2- High Prolif 4.062 Moderate 7 4 11.239750 9.739996 5.954311 5.419711 6.123056 9.787085 12.142178 11.433164

Similarly, the tail() function prints the bottom 6 rows of the data frame:

tail(metabric)
Output
Patient_ID Cohort Age_at_diagnosis Survival_time Survival_status Vital_status Chemotherapy Radiotherapy Tumour_size Tumour_stage Neoplasm_histologic_grade Lymph_nodes_examined_positive Lymph_node_status Cancer_type ER_status PR_status HER2_status HER2_status_measured_by_SNP6 PAM50 3-gene_classifier Nottingham_prognostic_index Cellularity Integrative_cluster Mutation_count ESR1 ERBB2 PGR TP53 PIK3CA GATA3 FOXA1 MLPH
MB-7294 4 59.20 82.73333 DECEASED Died of Disease NO NO 15 NA 2 1 2 Breast Invasive Ductal Carcinoma Positive Positive Negative GAIN LumB ER+/HER2- High Prolif 4.03 High 1 2 11.290976 10.846545 7.312247 5.660943 6.190000 9.424235 11.07569 11.567166
MB-7295 4 43.10 196.86667 LIVING Living NO YES 25 NA 3 1 2 Breast Invasive Lobular Carcinoma Positive Positive Negative NEUTRAL LumA ER+/HER2- Low Prolif 5.05 High 3 4 9.591235 9.935178 7.984515 6.753291 6.279207 9.207323 11.28119 11.337601
MB-7296 4 42.88 44.73333 DECEASED Died of Disease NO YES 20 NA 3 1 2 Breast Invasive Ductal Carcinoma Positive Negative Positive GAIN LumB NA 5.04 High 5 6 9.733986 13.753037 5.616082 6.271912 5.999093 9.530390 11.53203 11.626140
MB-7297 4 62.90 175.96667 DECEASED Died of Disease NO YES 25 NA 3 45 3 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB NA 6.05 High 1 4 11.053198 10.228570 7.478069 6.212256 6.192399 9.540589 11.48276 11.180360
MB-7298 4 61.16 86.23333 DECEASED Died of Other Causes NO NO 25 NA 2 12 3 Breast Invasive Ductal Carcinoma Positive Positive Negative NEUTRAL LumB ER+/HER2- High Prolif 5.05 Moderate 1 15 11.055114 9.892589 8.282737 6.466712 6.287254 10.365901 11.37118 12.827069
MB-7299 4 60.02 201.90000 DECEASED Died of Other Causes NO YES 20 NA 3 1 2 Breast Invasive Ductal Carcinoma Positive Negative Negative NEUTRAL LumB ER+/HER2- High Prolif 5.04 High 10 3 10.696475 10.227787 5.533486 6.180511 6.208784 9.749368 10.86753 9.847856

The colnames() function displays all the column names:

colnames(metabric)
 [1] "Patient_ID"                    "Cohort"                       
 [3] "Age_at_diagnosis"              "Survival_time"                
 [5] "Survival_status"               "Vital_status"                 
 [7] "Chemotherapy"                  "Radiotherapy"                 
 [9] "Tumour_size"                   "Tumour_stage"                 
[11] "Neoplasm_histologic_grade"     "Lymph_nodes_examined_positive"
[13] "Lymph_node_status"             "Cancer_type"                  
[15] "ER_status"                     "PR_status"                    
[17] "HER2_status"                   "HER2_status_measured_by_SNP6" 
[19] "PAM50"                         "3-gene_classifier"            
[21] "Nottingham_prognostic_index"   "Cellularity"                  
[23] "Integrative_cluster"           "Mutation_count"               
[25] "ESR1"                          "ERBB2"                        
[27] "PGR"                           "TP53"                         
[29] "PIK3CA"                        "GATA3"                        
[31] "FOXA1"                         "MLPH"                         

Building a Basic Plot

The construction of ggplot graphics is incremental, allowing for the addition of new elements in layers. This approach grants users extensive flexibility and customization options, enabling the creation of tailored plots to suit specific needs.

To build a ggplot, the following basic template can be used for different types of plots.

Three things are required for a ggplot:

1. The data

We first specify the data frame that contains the relevant data to create a plot. Here we are sending the metabric dataset to the ggplot() function.

# render plot background
ggplot(data = metabric)

This command results in an empty gray panel. We must specify how various columns of the data frame should be depicted in the plot.

2. Aesthetics aes()

Next, we specify the columns in the data we want to map to visual properties (called aesthetics or aes in ggplot2). e.g. the columns for x values, y values and colours.

Since we are interested in generating a scatter plot, each point will have an x and a y coordinate. Therefore, we need to specify the x-axis to represent the transcription factor (GATA3) and y-axis to represent the estrogen receptor alpha (ESR1).

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1))

This results in a plot which includes the grid lines, the variables and the scales for x and y axes. However, the plot is empty or lacks data points.

3. Geometric Representation geom_()

Finally, we specify the type of plot (the geom). There are different types of geoms:

geom_blank() draws an empty plot.

geom_segment() draws a straight line. geom_vline() draws a vertical line and geom_hline() draws a horizontal line.

geom_curve() draws a curved line.

geom_line()/geom_path() makes a line plot. geom_line() connects points from left to right and geom_path() connects points in the order they appear in the data.


geom_point() produces a scatterplot.

geom_jitter() adds a small amount of random noise to the points in a scatter plot.

geom_dotplot() produces a dot plot.

geom_smooth() adds a smooth trend line to a plot.

geom_quantile() draws fitted quantile with lines (a scatter plot with regressed quantiles).

geom_density() creates a density plot.


geom_histogram() produces a histogram.

geom_bar() makes a bar chart. Height of the bar is proportional to the number of cases in each group.

geom_col() makes a bar chart. Height of the bar is proportional to the values in data.


geom_boxplot() produces a box plot.

geom_violin() creates a violin plot.


geom_ribbon() produces a ribbon (y interval defined line).

geom_area() draws an area plot, which is a line plot filled to the y-axis (filled lines).

geom_rect(), geom_tile() and geom_raster() draw rectangles.

geom_polygon() draws polygons, which are filled paths.


geom_text() adds text to a plot.

geom_text() adds label to a plot.

The range of geoms available in ggplot2 can be obtained by navigating to the ggplot2 package in the Packages tab pane in RStudio (bottom right-hand corner) and scrolling down the list of functions sorted alphabetically to the geom_... functions.

Since we are interested in creating a scatter plot, the geometric representation of the data will be in point form. Therefore we use the geom_point() function.

To plot the expression of estrogen receptor alpha (ESR1) against that of the transcription factor, GATA3:

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) + geom_point() 

Notice that we use the + sign to add a layer of points to the plot. This concept bears resemblance to Adobe Photoshop, where layers of images can be rearranged and edited independently. In ggplot, each layer is added over the plot in accordance with its position in the code using the + sign.

Customizing Plots

Adding Colour

The above plot could be made more informative. For instance, the additional information regarding the ER status (i.e., ER_status column) could be incorporated into the plot. To do this, we can utilize aes() and specify which column in the metabric data frame should be represented as the color of the points.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour)) +
    geom_point(mapping = aes(colour = ER_status)) 

Notice that we specify the colour = ER_status argument in the aes() mapping inside the geom_() function instead of ggplot() function.

To colour points based on a continuous variable, for example: Nottingham prognostic index (NPI):

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = Neoplasm_histologic_grade)) 

In ggplot2, a color scale is used for continuous variables, while discrete or categorical values are represented using discrete colors.

Note that some patient samples lack expression values, leading ggplot2 to remove those points with missing values for ESR1 and GATA3.

Adding Shape

Let’s add shape to points.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) + 
  geom_point(mapping = aes(shape = `3-gene_classifier`))
Warning: Removed 204 rows containing missing values or values outside the scale range
(`geom_point()`).

Note that some patient samples have not been classified and ggplot has removed those points with missing values for the three-gene classifier.

The shape argument allows you to customize the appearance of all data points by assigning an integer associated with predefined shapes shown below:

To use asterix instead of points in the plot:

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) + 
  geom_point(shape = 8)

It would be useful to be able to change the shape of all the points. We can do so by setting the size to a single value rather than mapping it to one of the variables in the data set - this has to be done outside the aesthetic mappings (i.e. outside the aes() bit) as above.

Aesthetic Setting vs. Mapping

Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters (outside aes()). We map an aesthetic to a variable (e.g., aes(shape =3-gene_classifier)) or set it to a constant (e.g., shape = 8). If you want appearance to be governed by a variable in your data frame, put the specification inside aes(); if you want to override the default size or colour, put the value outside of aes().

# size outside aes()
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(shape = 8)
# size inside aes()
ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(aes(shape = `3-gene_classifier`))
Warning: Removed 204 rows containing missing values or values outside the scale range
(`geom_point()`).

The above plots are created with similar code, but have rather different outputs. The first plot sets the size to a value and the second plot maps (not sets) the size to the three-gene classifier variable.

It is usually preferable to use colours to distinguish between different categories but sometimes colour and shape are used together when we want to show which group a data point belongs to in two different categorical variables.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(aes(colour = PAM50, shape = `3-gene_classifier`))
Warning: Removed 204 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding Size and Transparency

We can adjust the size and/or transparency of the points.

Let’s first increase the size of points.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(mapping = aes(colour = PAM50), size = 2)

Note that here we add the size argument outside of the the aesthetic mapping.

Transparency can be useful when we have a large number of points as we can more easily tell when points are overlaid, but like size, it is not usually mapped to a variable and sits outside the aes().

Let’s change the transparency of points.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(mapping = aes(colour = `3-gene_classifier`), alpha = 0.5) 

Adding Layers

We can add another layer to this plot using a different geometric representation (or geom_ function) we discussed previously.

Let’s add trend lines to this plot using the geom_smooth() function which provide a summary of the data.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Note that the shaded area surrounding blue line represents the standard error bounds on the fitted model.

Let’s make the plot look a bit prettier by reducing the size of the points and making them transparent. We’re not mapping size or alpha to any variables, just setting them to constant values, and we only want these settings to apply to the points, so we set them inside geom_point().

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth() 
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Let’s add some colour to the plot.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1, colour = ER_status)) +
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Adding Labels

By default, ggplot use the column names specified inside the aes() as the axis labels. We can change this using the x = and y = arguments in labs() function.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  labs(x = "GATA3 Expression",
       y = "ESR1 Expression")

You can also add a title, a subtitle, a caption or a tag.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
  geom_smooth() +
  labs(
    title = "Expression of estrogen receptor alpha against the transcription factor",
    subtitle = "ESR1 vs GATA3",
    caption = "This is a caption",
    tag = "Figure 1",
    x = "GATA3 Expression",
    y = "ESR1 Expression")

Themes

Themes control the overall appearance of the plot, including background color, grid lines, axis labels, and text styles. ggplot offers several built-in themes, and you can also create custom themes to match your preferences or the requirements of your publication. The default theme has a grey background.

ggplot(data = metabric, mapping = aes(x = GATA3, y = ESR1)) +
  geom_point(mapping = aes(colour = ER_status), size = 0.5, alpha = 0.5) +
  geom_smooth() + theme_bw()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Try these themes yourselves: theme_classic(), theme_dark(), theme_grey() (default), theme_light(), theme_linedraw(), theme_minimal(), theme_void() and theme_test().

Different Types of Plots

Bar chart

The metabric study redefined how we think about breast cancer by identifying and characterizing several new subtypes, referred to as integrative clusters. Let’s create a bar chart of the number of patients whose cancers fall within each subtype in the metabric cohort.

The geom_bar is the geom used to plot bar charts. It requires a single aesthetic mapping of the categorical variable of interest to x.

ggplot(data = metabric) +
  geom_bar(aes(x = Integrative_cluster))

The dark grey bars are a big ugly - what if we want each bar to be a different colour?

ggplot(data = metabric) +
  geom_bar(aes(x = Integrative_cluster, colour = Integrative_cluster))

Colouring the edges wasn’t quite what we had in mind. Look at the help for geom_bar to see what other aesthetic we should have used.

ggplot(data = metabric) +
  geom_bar(aes(x = Integrative_cluster, fill = Integrative_cluster))

Box plot

Box plots (or box & whisker plots) are a particular favourite seen in many seminars and papers. Box plots summarize the distribution of a set of values by displaying the minimum and maximum values, the median (i.e. middle-ranked value), and the range of the middle 50% of values (inter-quartile range). The whisker line extending above and below the IQR box define Q3 + (1.5 x IQR), and Q1 - (1.5 x IQR) respectively.

To create a box plot from Metabric dataset:

ggplot(data = metabric, mapping = aes(x = ER_status, y = GATA3)) +
  geom_boxplot()

Let’s try a colour aesthetic to also look at how estrogen receptor expression differs between HER2 positive and negative tumours.

ggplot(data = metabric, mapping = aes(x = ER_status, y = GATA3, colour = HER2_status)) +
  geom_boxplot() 

Violin plot

A violin plot is used to visualize the distribution of a numeric variable across different categories. It combines aspects of a box plot and a kernel density plot.

The width of the violin at any given point represents the density of data at that point. Wider sections indicate a higher density of data points, while narrower sections indicate lower density. By default, violin plots are symmetric.

ggplot(data = metabric, aes(y = GATA3, x = ER_status, colour = HER2_status)) + 
  geom_violin()

Histogram

The geom for creating histograms is, rather unsurprisingly, geom_histogram().

ggplot(data = metabric) +
  geom_histogram(aes(x = Age_at_diagnosis))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The warning message hints at picking a more optimal number of bins by specifying the binwidth argument.

ggplot(data = metabric) +
  geom_histogram(aes(x = Age_at_diagnosis), binwidth = 5)

Or we can set the number of bins.

ggplot(data = metabric) +
  geom_histogram(aes(x = Age_at_diagnosis), bins = 20)

These histograms are not very pleasing, aesthetically speaking - how about some better aesthetics?

ggplot(data = metabric) +
  geom_histogram(
    aes(x = Age_at_diagnosis), 
    bins = 20, 
    colour = "darkblue", 
    fill = "grey")

Saving plot images

Use ggsave() to save the last plot you displayed.

ggsave("age_at_diagnosis_histogram.png")

You can alter the width and height of the plot and can change the image file type.

ggsave("age_at_diagnosis_histogram.pdf", width = 20, height = 12, units = "cm")

Exercise

  1. Generate the following plot.

You are required to:

  • Specify the dataset as metabric
  • x-axis plots Integrative_cluster column and y-axis plots ESR1 column.
  • Use the suitable geom function
  • Use labs(x = ?, y = ?) method and replace ? with correct x and y labels.

  1. The default theme has the characteristic grey background which isn’t particularly suitable for printing on paper. We can change to one of a number of alternative themes available in the ggplot2 package. Add a theme to create the following plot.


These content were adapted from the Introduction to R: exploring the tidyverse course materials.